Modern Data Analytics¶

Project¶

KU Leuven - Angelo Patane' r0793881¶

Table of Contents¶

1. Introduction¶

2. Exploratory Data Analysis¶

3. Analysis¶

Introduction¶

Introduction - Cleaned Dataset 1, CO2 Emission From Fuel Combustion 2020 Edition.¶

In [2]:
df_final
Out[2]:
Country Year CO2 emissions Population Continent
0 World 1971 67.974998 71.190002 NaN
1 World 1972 71.264000 72.639999 NaN
2 World 1973 75.351997 74.102997 NaN
3 World 1974 75.197998 75.533997 NaN
4 World 1975 75.483002 76.933998 NaN
... ... ... ... ... ...
8587 Oceania 2014 143.311996 137.014008 NaN
8588 Oceania 2015 145.908997 139.089005 NaN
8589 Oceania 2016 148.843002 141.373993 NaN
8590 Oceania 2017 150.341995 143.796005 NaN
8591 Oceania 2018 149.811005 146.080994 NaN

8592 rows × 5 columns

Introduction - Cleaned Dataset 2, The World Bank.¶

In [3]:
df_merge
Out[3]:
Country Year CO2 emissions Population Continent Country Code Population growth (annual %)
0 World 1971 67.974998 71.190002 NaN WLD 2.133117
1 World 1972 71.264000 72.639999 NaN WLD 2.031211
2 World 1973 75.351997 74.102997 NaN WLD 1.982943
3 World 1974 75.197998 75.533997 NaN WLD 1.929549
4 World 1975 75.483002 76.933998 NaN WLD 1.855834
... ... ... ... ... ... ... ...
6079 United Arab Emirates 2014 340.184998 504.048004 Asia ARE 0.176775
6080 United Arab Emirates 2015 359.829010 506.729004 Asia ARE 0.527292
6081 United Arab Emirates 2016 370.427002 512.090027 Asia ARE 1.053271
6082 United Arab Emirates 2017 386.880005 518.981995 Asia ARE 1.339470
6083 United Arab Emirates 2018 371.341003 526.859985 Asia ARE 1.503938

6084 rows × 7 columns

Exploratory Data Analysis¶

In [5]:
fig1.show()
In [7]:
fig2.show()
In [9]:
fig3.show()
In [11]:
fig4.show()
In [15]:
fig5.show()

Analysis - Year 2018¶

Pearson Correlation Coefficient $r$¶

It measures the linear association strength between two continuous variables.¶
\begin{equation} r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}, \end{equation}

with $\bar{x}$ and $\bar{y}$ the sample means.

Moreover, given the sample correlation coefficient $r$, the following hypothesis test is performed $$ H_0: \rho = 0, \;\; H_1: \rho \neq 0,$$ with $\rho$ the real correlation coefficent.

In [18]:
stats.pearsonr(df_merge_2018_1['CO2 emissions'], df_merge_2018_1['Population growth (annual %)'])
Out[18]:
(0.48855680323340167, 1.7316896372956907e-08)

$r = 0.49 \Rightarrow$ moderate linear relationship $(0.3 \leq |r| \leq 0.5)$;

$p\text{-value} = 1.73\text{e}-8 \Rightarrow$ very strong evidence to reject the null hypothesis $H_0: \rho = 0$.

Cluster Analysis (K-Means Clustering)¶

Finds groups or clusters in a data set so that observations within each cluster are similar to each other based on some features.¶

Let $K$ be the desired number of clusters, $C_1,\ldots, C_K$ the indices set of the observations in each cluster such that $\bigcup_{i=1}^K C_i = \{1, \ldots, n\}$ and $C_k\; \cap \;C_{k^{'}} = \emptyset,\; \forall k \neq k^{'}$, with $n$ denoting the total number of observations.

Let $W(C_k)$ be the within-cluster variation defined using the squared Euclidean distance. That is, $$ W(C_k) = \frac{1}{|C_k|}\sum_{i, i^{'} \in \; C_k}\;\sum_{j = 1}^{p}(x_{ij} - x_{i^{'}j})^2,$$ with $|C_k|$ indicating the number of measurements in the $k$th cluster, and $p$ the number of features.

The objective of $K$-means clustering is to find a clustering such that the within-cluster variation $W(C_k)$ is the smallest as possible. That is, \begin{equation} \label{eq:kMeansObjective} \min_{C_1, \ldots, C_K} \sum_{k=1}^{K} \frac{1}{|C_k|}\sum_{i, i^{'} \in \; C_k}\;\sum_{j = 1}^{p}(x_{ij} - x_{i^{'}j})^2 . \end{equation}

The solution of the algorithm is a local optimum rather than a global one $\Rightarrow$ The algorithm is run several times (i.e., 10), and the best solution is selected.

The two features 'CO2 emissions' and 'Population growth (annual %)' are measured in two different scales $\Rightarrow$ They are scaled to have standard deviation 1 so that they have the same impact on the computation of the distances.

In [22]:
fig7.show()
In [25]:
fig8.show()
In [27]:
fig9.show()